helpers

Use helpers in your recordExtractor to make it easier to extract relevant content from your page. Algolia has a selection of helpers:

product
article
page
splitContentIntoRecords
codeSnippets
docsearch.

`product`

This helper extracts content from product pages. A “product page” is an HTML page with one of thes JSON-LD schema types:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.product({ url, $ });
}

Response

The helper returns an object with the following properties:

objectID

string

The product page’s URL.

url

string

The product page’s URL (without parameters or hashes).

lang?

string

The language the page content is written in (from the name field of the JSON-LD product schema).

sku

string

The sku field of the JSON-LD schema.

description?

string

The description field of the JSON-LD schema.

image?

string

The image field of the JSON-LD schema.

price?

string

The product’s price, selected from one of these JSON-LD schema fields, in the order:

offers.price
offers.highPrice
offers.lowPrice.

currency?

string

The offers.priceCurrency field of the JSON-LD schema.

category?

string

The category field of the JSON-LD schema.

`article`

This helper extracts content from article pages. An “article page” is an HTML page with an appropriate JSON-LD schema or meta tag:

One of these JSON-LD schema types:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.article({ url, $ });
}

Response

The helper returns an object with the following properties:

objectID

string

The article’s URL.

url

string

The article’s URL (without parameters or hashes).

lang?

string

The language the article is written in (from the HTML lang attribute)

headline

string

The article’s headline, selected from one of these, in the order:

meta[property="og:title"]
meta[name="twitter:title"]
head > title
First <h1>.

description?

string

The article’s description, selected from one of these, in the order:

meta[name="description"]
meta[property="og:description"]
meta[name="twitter:description"].

keywords

string array

The keywords field of the JSON-LD schem.

`page`

This helper extracts text from pages regardless of its type or category.

recordExtractor: ({ url, $, helpers }) => {
  return helpers.page({
    url,
    $,
    recordProps: {
      title: 'head title',
      content: 'body',
    },
  });
}

Response

The helper returns an object with the following properties:

objectID

string

The object’s unique identifier.

url

string

The page’s URL.

hostname

string

The URL hostname (for example, example.com).

path

string

The URL path: everything after the hostname.

depth

number

The URL depth, based on the number of slashes after the domain. For example, http://example.com/ = 1, http://example.com/about = 1, http://example.com/about/ = 2.

fileType

file type

The page’s file type. One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, or email.

contentLength

number

The page length in bytes.

title?

string

The page title, derived from head > title.

description?

string

The page’s description, derived from meta[name="description"].

keywords?

string array

The page’s keywords, derived from meta[name="keywords"].

image?

string

The image associated with the page, derived from meta[property="og:image"].

headers?

string array

The page’s section titles, derived from h1 and h2.

content

string

The page’s content (body copy).

`splitContentIntoRecords`

This helper extracts text from long HTML pages and splits them into smaller chunks. This can help prevent “Record too big” errors. Using this example record extractor on a long page returns an array of records, each one smaller than 1,000 bytes.

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // Produced records can be modified after creation, if necessary.
  return records;
}

When splitting pages, some words will appear in records belonging to the same page. If you don’t want these duplicates to turn up when users search:

Set distinct to true in your index. distinct: true
Set the attributeForDistinct to your page’s URL. For example, attributeForDistinct: 'url'.
Set searchableAttributes’ to be your page title and body content. For example, [ 'searchableAttributes: [ 'title', 'text' ].
Add a customRanking to sort from the first split record on your page to the last. For example, customRanking: [ 'asc(part)' ].

JavaScript

initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Response

Specify one or more response parameters in your helper to determine what information is returned.

baseRecord

record

default:"{}"

Takes this record’s attributes (and values) and adds them to all the split records.

$elements

string

default:"$('body')"

A Cheerio selector that determines from which elements content will be extracted. For more information, see Extracting data with Cheerio.

maxRecordBytes

number

default:"10000"

Maximum number of bytes allowed per record. To avoid errors, check your plan’s record size limits.

orderingAttributeName

string

This attribute stores the sequentially generated number assigned to each record when the helper splits a page.

textAttributeName

string

default:"text"

Name of the attribute in which to store the text of each split record.

`codeSnippets`

Use this helper to extract code snippets from pages. The helper finds code snippets by looking for <pre> tags and extracting the content and the language class prefix from the tag.

If the crawler finds several code snippets on a page, the helper returns a list of those snippets.

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  const code = helpers.codeSnippets({ tag, languageClassPrefix })
  return { code };
}

Response

The helper returns an array of code objects with the following properties:

content

string

The code snippet.

languageClassPrefix?

string

The code snippet’s language (if found).

codeUrl?`

string

The URL of the nearest sibling <a> tag.

fragmentUrl?

string

Text fragment URL with the code snippet. This is a selection of text within a page that’s linked to another page.

`docsearch`

This helper extracts content and formats it to be compatible with DocSearch. It creates an optimized number of records for relevancy and hierarchy. You can also use it without DocSearch or to index non-documentation content. For more information, see the DocSearch documentation.

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}

Tools

Crawler

`product`

Response

`article`

Response

`page`

Response

`splitContentIntoRecords`

Response

`codeSnippets`

Response

`docsearch`

Tools

Crawler

​product

​Response

​article

​Response

​page

​Response

​splitContentIntoRecords

​Response

​codeSnippets

​Response

​docsearch

`product`

Response

`article`

Response

`page`

Response

`splitContentIntoRecords`

Response

`codeSnippets`

Response

`docsearch`